Chromosome-level genome assembly of Guide Black-Fur sheep (Ovis aries)

Guide Black-Fur sheep (GD) is a breed of Tibetan sheep (Ovis aries) that lives in the Qinghai–Tibetan plateau region at an altitude of over 4,000 m. However, a lack of genomic information has made it difficult to understand the high-altitude adaptation of these sheep. We sequenced and assembled the GD reference genome using PacBio, Hi-C, and Illumina sequencing technologies. The final assembled genome size was 2.73 Gb, with a contig N50 of 20.30 Mb and a scaffold N50 of 107.63 Mb. The genome is predicted to contain 20,759 protein-coding genes, of which 98.42 have functional annotations. Repeat elements account for approximately 52.2% of the genomic landscape. The completeness of the GD genome assembly is highlighted by a BUSCO score of 93.1%. This high-quality genome assembly provides a critical resource for future molecular breeding and genetic improvement of Tibetan sheep.


Background & Summary
Tibetan sheep play a unique and essential role in China's rural revitalization strategy.High-quality development of the Tibetan sheep industry is expected to improve social and economic development while maintaining the ecosystem stability and national security of the Qinghai-Tibetan Plateau.Germplasms of these sheep are considered valuable because of their unique biological characteristics, high genetic stability, stress resistance, meat quality, and carpet wool quality [1][2][3] .Additionally, having undergone long-term natural and artificial selection under specific ecological conditions, they are a model species for studying adaptations to extreme environments.Therefore, assembly of the Tibetan sheep reference genome at the chromosome level will substantially facilitate comparative and functional genomics research.Although draft genome assemblies of certain sheep species have been released, their quality metrics were low [4][5][6][7][8] .Furthermore, no comprehensive genomes of sheep from high-altitude environments (>4,000 m) are available.
Guide Black-Fur sheep (GD) are known for their black skins and spend most of the year grazing in their core habitat of Guide County, Hainan Tibetan Autonomous Prefecture of Qinghai, China (altitude ~4,100 m).GD sheep exhibit a variety of fleece colors, ranging from blackish-red to grey.Black skin contains more melanocytes than other skin colors and thus synthesizes more melanin, absorbing ultraviolet radiation and thereby reducing its damaging effects on organs 9,10 .This adaptation strategy has allowed GD to thrive in a robust ultraviolet environment.
In this study, we aimed to produced high-quality, chromosome-scale, de novo genome assemblies of the high-altitude environments GD using Illumina, PacBio, and chromosome conformation capture (Hi-C) sequencing.Comparisons with other mammalian genomes helped to identify rapidly evolving genes and elucidate the evolutionary history of Tibetan sheep.

Methods
Sample collection and sequencing.All animal experiments were performed under the guidance of ethical regulations from the Institutional Animal Care and Use Committee of Lanzhou Institute of Husbandry and Pharmaceutical Science of Chinese Academy of Agricultural Sciences (Approval No. NKMYD201805; Approval Date: 18 October 2018).We randomly selected one adult male GD (age: 3 years) from flocks raised under standard feeding regimes and with free access to water in Hainan Tibetan Autonomous Prefecture, Qinghai, China (altitude: ~4,100 m).Blood samples were collected from the jugular vein and then preserved in EDTA anti-freezing tubes at −20 °C.The animals was slaughtered by exsanguination after being deprived of food for 24 h.The brain, heart, liver, kidney, spleen, lung, and muscle tissues were excised within 30 min and stored in liquid nitrogen.
For PacBio long-read and short-read sequencing, genomic DNA was extracted from the blood and liver of the GD.The DNA was sequenced at Frasergen Bioinformatics (Wuhan, China) using the PacBio Sequel platform (Pacific Biosciences), yielding a total of 267.82 Gb of PacBio continuous long reads for corresponding to 98 × genomic coverage (Table 1).The DNA was re-sequenced using HiSeq X-Ten (Illumina, CA, USA) to correct long reads.Paired-end sequencing produced 263.48 Gb of short-read data corresponding to 96 × genomic coverage (Table 1).
The PacBio Iso-Seq method sequenced full-length transcripts via Single Molecule Real-Time (SMRT) sequencing technology.Total RNA was extracted from all tissues (brain, femur, spleen, back muscle, kidney, heart, lung, and liver) using the RN33 kit (Aidlabs Biotechnologies, Beijing, China).All required profiles were obtained using the same sequencing platform described for PacBio, producing 36.90Gb of Iso-Seq reads (Table 1).
For Hi-C sequencing, fresh liver tissues were collected for library construction, which was performed as previously described 11 .Paired-end (2 × 150 bp) sequencing was conducted on the Illumina HiSeq X-Ten platform to abtain 494.40 Gb of data corresponding to 181 × genomic coverage (Table 1).
Genome assembly and polishing.All subreads from SMRT sequencing were used for de novo assembly of the GD genome.The draft genome assembly was obtained using MECAT2 with default parameters 12 .Variant calling with gcpp in the SMRT link 4 toolkit was performed to correct errors after the initial genome assembly.Next, initial assembly contigs from the previous step were mapped with corresponding HiSeq reads and polished once using Pilon (v1.22) to correct any remaining errors 13 .This gave a total of 1,972 primary contigs for GD, with an N50 size of 20.30 Mb (Table 2).
Hi-C scaffolding.Next-generation sequencing short reads were aligned to their assembly using the Burrows-Wheeler Aligner (BWA-MEM algorithm, v0.7.17) to further increase single base accuracy 14 .Purge Haplotigs was used to filter redundant sequences resulting from genome heterozygosity 15 .Pseudochromosomes were dcreated during Hi-C analysis, as described previously 16 .Briefly, all clean read pairs produced from the Hi-C library were mapped to the polished sheep contig assembly using the BWA-MEM algorithm with default parameters.LACHESIS was then used to cluster contigs into chromosome-level scaffolds based on the genomic proximity signal of Hi-C data 17 .After scaffolding, we obtained 27 chromosomes containing > 98% of the contigs (Table 3, Fig. 1).The final assembled genome (2.73 Gb) had a scaffold N50 of 107 Mb and 1,246 gaps in total (Table 2).annotation of repetitive sequences and genes.De novo and homology-based prediction methods annotated repeat sequences in the GD genome.Known transposable elements were identified by combining RepeatMasker, RepeatProteinMask, and RepeatModeler 18 .Repetitive transposable element (TE) sequences comprised ~52% of the total assembly of the genome, with long terminal repeat retrotransposons being the most abundant (~20.99%)(Table 4).
For gene annotation, we adopted a strategy combining ab initio gene finding, homology-based gene prediction, and Iso-Seq reads.First, the assembled GD genome was hard-and soft-masked by RepeatMasker.Second, ab initio gene prediction was performed using the Augustus (v3.3.1) and SNAP gene models, which were pre-trained using homologous proteins 19 .Third, Exonerate (v2.2.0) set to default parameters was used to predict genes from protein sequences 20 .Fourth, clean RNA-sequencing reads were assembled into transcripts via Trinity for RNA-based gene prediction, followed by further prediction of gene structure using PASA 21 .Finally, Maker (v3.00) was employed to integrate the three sets of prediction results 22 .In total, we identified 20,759 predicted protein-coding genes within the genome, representing 98.42% of all genes (Table 5).A comparison of gene features among GD, Texel sheep (Ovis aries) 23 , Rambouillet sheep (Ovis aries) 24 , San Clemente goats (Capra hircus) 25 , and humans (Homo sapiens) 26 revealed similar length distributions for coding sequences, genes, exons, and introns (Fig. 2).The tRNA-related genes were mainly identified by tRNAscan-SE (v1.3.1) and Infernal (v1.1.2) using default parameters 18,27 .A total of 256,815 non-coding RNA genes were predicted, comprising 528 miRNA, 231 rRNA, 264,021 tRNA, and 2,017 snRNA genes (Table 6).

Data Records
Raw data from the long-read and short-read sequencing have been deposited in the NCBI Sequence Read Archive database with accession numbers SRR22290763 (Iso-seq) and SRR22585187 (whole-genome sequencing) under BioProject PRJNA898852 28,29 .The final chromosome assembly was deposited in the GenBank at NCBI with accession number: JBEJUG010000000 30 .The draft genome assembly and genome annotation were deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.26013145) 31.

Technical Validation
evaluation of the genome assembly.Quality metrics included assessment of the completeness of each genome based on the proportion of single-copy conserved orthologs obtained and specific signs of mis-assembly.Benchmarking Universal Single-Copy Orthologs (BUSCO) software 32 revealed that ~93% of the conserved genes were identified in the GD genome, confirming the completeness of the obtained assemblies.The 93.1% of single-copy conserved orthologs in this assembly was higher than those found in Hu sheep and Texel sheep 23,33 .The percentage of duplicated complete BUSCOs detected in the GD genome (1.9%) was comparable to the ranges  observed across the genomes of Texel sheep and Rambouillet sheep (0.9%-1.6%) 23,24 .The GD genome also exhibited a lower percentage of missing BUSCOs (4.2%) compared with the Texel sheep genome (11.1%) 23 , suggestive of genomic integrity and the precise assembly of highly complex regions (Table 7).

Genome collinearity analysis.
Genome synteny analysis of the GD, Texel, and Rambouillet sheep breeds was performed using the MUMmer tool with default parameters (filtering of delta sequences with the -1 parameter and removal of collinear fragments <10 kb).Single-nucleotide polymorphisms and variations between the three genomes were identified using the show-snps (-rT parameter) and show-diff (-rH parameter) utilities, respectively.Overall, the GD genome demonstrates strong collinearity with Texel and Rambouillet sheep.Some obvious insertions and inversions on chromosomes 1, 3, and 7 may be species-specific (Fig. 3).

Fig. 1
Fig. 1 Hi-C interaction heatmap for the GD genome.Color shading from light to dark indicates the increase in interaction intensit.

Fig. 2
Fig.2Comparison of gene features among the genomes of GD and four other animal species.Gene features include the lengths of genes, CDS, exons, and introns.

Table 2 .
Assembly statistics of Guide black fur sheep.

Table 3 .
Summary of the Hi-C grouping results in each pseudo chromosome.

Table 5 .
Functional annotation statistics of the GD genome.Note: GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; NR, non-redundant.

Table 6 .
Information on non-protein-coding genes identified in the GD genome.